This section presents the analysis of research resources using the Research Resource Identifiers (RRIDs).
Before directly diving into the actual analysis, first exploratory data analysis has been done to understand the structure of data, file format, number of journals, number of publications, etc. This step was found to be important to optimize the code and for creating parallel processes for RRID extraction in the second step to speed up the RRID extraction.
In the second step, different factors have been taken into account to formulate a regular expression (regex) formula. Then the regex formula is used to extract RRIDs keys from 3.2 million publications ( available in XML format). The RRID extraction has been done using multi-processing with 32 parallel processes. The intermediate result of the RRID extraction is then stored in a nested dictionary as shown below:
Then the dictionary file is saved as a JSON file. Further information about the publications like electronic publication date epub, article title, etc. has also been retrieved from the XML files and saved as a CSV file for analysis. After that, RRID analysis and visualization have been done using the extracted RRIDs ( in JSON and CSV files ). Finally, facts of the data analysis have been summarized and some recommendation is also made for future work.
The sequence of steps used in this project is shown below:

Note:
The analysis in this project is made in a procedural manner so that results shown in this work are reproducible.
The publications are available at /mnt/nvs3/nlp-public/Pubmed/pubmed_xml/ for elaine Server.
Exploring Data, often known as Exploratory Data Analysis (EDA), is an important step at the initial phase of data analysis. EDA helps to understand the structure of the data, gives a glimpse into the main features of the data such as data format, structure, etc. [37]
# import essential libraries
import os
from os import listdir
import simplejson as sj
import json
import xml.etree.ElementTree as ET
import concurrent.futures
import time
import re
import operator
import pandas as pd
from os import listdir
#Create list of pmed_directories
pmed_directory_list = listdir("/mnt/storage1/nlp/pubmed_xml/")
pmed_directory_list.sort()
print(f"We have {len(pmed_directory_list)} Journals in the PubMed.\n")
# sample PubMed Journal directories
print ("Sample Journal names:")
pmed_directory_list[:5]
import os
# List all files in a directory
xml_samle_file_path = "/mnt/storage1/nlp/pubmed_xml/Zygote"
with os.scandir(xml_samle_file_path) as entries:
for entry in entries:
if entry.is_file():
print(f'file format: {entry.name}')
xml_list = []
path ="/mnt/storage1/nlp/pubmed_xml/"
for dir_ in pmed_directory_list[:]:
xml_path = path+dir_ #creates path to the directory that contains xml files
if os.path.isdir(xml_path): # check if a directory exists
for xml_file in os.listdir(xml_path):
if not xml_file.endswith('.nxml'): continue # skip non xml files
xml_list.append(xml_file)
print(f"Total Number of publications: {len(xml_list)}")
To find the biggest journals, that have the largest number of publications:
mapper_function defined below. # define a function that mapps each publication to its Journal
def mapper_function(dir_):
path = "/mnt/storage1/nlp/pubmed_xml/"
map_dict = {}
#create a path to the directory
xml_path = path+dir_
xml_list = []
if os.path.isdir(xml_path): # check if a directory exists
for xml_file in os.listdir(xml_path):
# skip if the file is not xml
if not xml_file.endswith('.nxml'): continue
# crate a path to the xml files
xml_file_path = os.path.join(xml_path, xml_file)
# list of publications in a journal
xml_list.append(xml_file)
# store in the dictionary
map_dict[dir_] = xml_list
return map_dict
# map each xml publication to its parent Journal using ProcessPoolExecutor
import concurrent.futures
Journal_xml_mapping_dict = {}
with concurrent.futures.ProcessPoolExecutor() as executor:
# creates 32 processes
for i in range(0,len(pmed_directory_list),32):
for dir_rk in pmed_directory_list[i:i+32]:
result_list = []
results = executor.submit(mapper_function, dir_rk) # returns directory-to-publication mapping
result_list.append(results) # stores the processes reults in the list
for f in concurrent.futures.as_completed(result_list):
f.result()
Journal_xml_mapping_dict.update(f.result()) # save the Journal_xml_mapping_dict as json file
# save the number of publications in each Journal as journal_size_dict
journal_size_dict = {}
for journal in list(Journal_xml_mapping_dict):
journal_size_dict[journal] = len(Journal_xml_mapping_dict.get(journal)) # save the journal_size_dict as json file
# sort journals in a descending order by publication count
Journal_size_desceding_order = dict(sorted(journal_size_dict.items(), key=lambda item: item[1], reverse=True)[:])
# Take the top 10 JOurnals by publication
Top_10_RRID_Journals_desc= dict(sorted(Journal_size_desceding_order.items(), key=operator.itemgetter(1), reverse=True)[:10])
# Visualization of top 10 journals by RRID count
import matplotlib.pyplot as plt
# plot color style
plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14,7))
# plot bars
ax.bar(list(Top_10_RRID_Journals_desc.keys()), list(Top_10_RRID_Journals_desc.values()))
# annotate the plot
ax.set_ylabel("Number of Publications")
ax.set_xlabel("Journals")
ax.set_title("Top 10 Journals by no. of publications")
ax.set_xticklabels(list(Top_10_RRID_Journals_desc.keys()), rotation=15)
fig.savefig("Plot: 1.Top 10 Journals by no. of publications.png", dpi = 130)
plt.show()
Using the result found above, journals with publication count in them journal_size_dict:
Note: Only journals with more than 2,000 publications have been visualized below.
import pandas as pd
import numpy as np
np.random.seed(42)
N = 15167
# generate random x and y to place the bubble in the x-y plane
x = np.random.normal(170, 20, N)
y = x + np.random.normal(5, 25, N)
colors = np.random.rand(N)
Journals = np.array(list(journal_size_dict.keys()))
count = np.array(list(journal_size_dict.values()))
# create a data frame
df = pd.DataFrame({'X': x,'Y': y,'colors':colors,'Journals': Journals,"Number of Publications":count})
# subset journals with more than 2k publications
df_2k = df[df['Number of Publications'] > 2000 ]
df_2k
# visualize the result
import plotly.express as px
import plotly.graph_objects as go
fig = px.scatter(df_2k, x="X", y="Y",size="Number of Publications",hover_name="Journals",log_x=True, size_max=60)
fig.show()
In this project Research Resource IDentifiers (RRIDs) have been extracted using regular expressions (RegX).
Regular expressions are very powerful to extract information from text files particularly when text information has a specific pattern that can be generalized. Since RRID keys have a unique syntax of the form RRID:prefix_Identifier, regular expressions can be used to effectively retrieve all RRID citations from publications as long as the RegX formula can handle all variants of RRID resource citations.
To capture as many RRIDs as possible, our RegX formula needs to be robust and must be able to generalize to all forms and variations of RRID citations.
Even though RRIDs have a specific syntax which we can specify using RegX right away, it is required to consider how authors might use RRIDs. For instance, in a sample publication (can be found here), the author intentionally ignored the RRID part when citing the resource as shown on the figure below:

On the other hand, we have some research resources where the prefix is composed of a set of strings separated by a special character. For instance: RRID:IMSR_JAX:000664 has slightly a different prefix pattern compared to RRID:AB_10564097.
Therefore, to effectively retrieve all RRID citations in the publications, the two factors have been considered in this work as follows:
I. Inconsistencies of RRID use by authors and
II. Variants of RRID prefixes.
Three types of inconsistencies that occur when authors use RRIDs in their citations are the inconsistency of Case, Space and Ignored part from the RRID-Syntax.
Case Inconsistency: upper Case and Lower Case - to handle latter case-related inconsistencies, the case ignore flag has been used with the regex.
Space Inconsistency: The most common space inconsistency observed is one-space after RRID:. Some authors included one-space after RRID:. For example: In a sample paper (can be found here), the author added space in 3 instances of RRID citations and no space in one instance as shown below:
Usually, authors included the RRID-Part and only in fewer cases authors ignored the RRID: Part. For this reason, the regular expression search pattern is formulated to handle both cases (with/without RRID-Part). More elaborate analysis for Ignored/Included RRID part has been presented in the RRID analysis part.
Some RRID patterns are slightly different from most RRID patterns by their RRID Prefix orientation. For example: RRID:IMSR_JAX:033255 vs RRID:AB_10564097. To accommodate this factor into our regular expression search pattern, a list of RRID prefixes [31] has been used to capture all variants of RRIDs from the scientific publications.
After Considering all the above factors (all the inconsistencies and varients of RRIDs), a RegX search pattern for RRIDs has been formulated for all variants of RRIDs as follows:
RRID: ?SCR_[0-9]{4,}|(?!RRID:) ?SCR_[0-9]{4,}|RRID: ?OMICS_[0-9]{4,}|(?!RRID:) ?OMICS_[0-9]{4,}|RRID: ?AB_[0-9]{4,}|(?!RRID:) ?AB_[0-9]{4,}|RRID: ?CVCL_[0-9]{4,}|(?!RRID:) ?CVCL_[0-9]{4,}|RRID: ?CVCL_[A-Z]+[0-9]+|(?!RRID:) ?CVCL_[A-Z]+[0-9]+|RRID: ?BDSC_[0-9]{4,}|(?!RRID:) ?BDSC_[0-9]{4,}|RRID: ?RGD_[0-9]{4,}|(?!RRID:) ?RGD_[0-9]{4,}|RRID: ?IMSR_JAX:[0-9]{6,}|(?!RRID:) ?IMSR_JAX:[0-9]{6,}|RRID: ?Addgene_[0-9]{4,}|(?!RRID:) ?Addgene_[0-9]{4,}|RRID: ?DGGR_[0-9]{4,}|(?!RRID:) ?DGGR_[0-9]{4,}|RRID: ?EXRC_[0-9]{4,}|(?!RRID:) ?EXRC_[0-9]{4,}|RRID: ?NSRRC_[0-9]{4,}|(?!RRID:) ?NSRRC_[0-9]{4,}|RRID: ?MGI_[0-9]{4,}|(?!RRID:) ?MGI_[0-9]{4,}
Note: The above regX search pattern is a truncated version after removing RRIDs which are not found on the Sci-Crunch.
Using the RegX formula specified above, extraction of RRIDs from the XML publications has been done following the steps below:
ElementTree
RRID_extractor function¶The RRID_extractor function is an implementation of the above pipeline. To speed up the RRID extraction process, multiple calls have been made to this function using a process pool executor.
# RRID_extractor function -- extracts RRIDs from each Journal
def RRID_extractor(dir_):
import re
import xml.etree.ElementTree as ET
path = '/mnt/storage1/nlp/pubmed_xml/'
# regular expression formula
all_rrid_search = '''RRID: ?SCR_[0-9]{4,}|(?!RRID:) ?SCR_[0-9]{4,}|RRID: ?OMICS_[0-9]{4,}|(?!RRID:) ?OMICS_[0-9]{4,}|RRID: ?AB_[0-9]{4,}|(?!RRID:) ?AB_[0-9]{4,}|RRID: ?CVCL_[0-9]{4,}|(?!RRID:) ?CVCL_[0-9]{4,}|RRID: ?CVCL_[A-Z]+[0-9]+|(?!RRID:) ?CVCL_[A-Z]+[0-9]+|RRID: ?BDSC_[0-9]{4,}|(?!RRID:) ?BDSC_[0-9]{4,}|RRID: ?RGD_[0-9]{4,}|(?!RRID:) ?RGD_[0-9]{4,}|RRID: ?IMSR_JAX:[0-9]{6,}|(?!RRID:) ?IMSR_JAX:[0-9]{6,}|RRID: ?Addgene_[0-9]{4,}|(?!RRID:) ?Addgene_[0-9]{4,}|RRID: ?DGGR_[0-9]{4,}|(?!RRID:) ?DGGR_[0-9]{4,}|RRID: ?EXRC_[0-9]{4,}|(?!RRID:) ?EXRC_[0-9]{4,}|RRID: ?NSRRC_[0-9]{4,}|(?!RRID:) ?NSRRC_[0-9]{4,}|RRID: ?MGI_[0-9]{4,}|(?!RRID:) ?MGI_[0-9]{4,}'''
# comile the regx formula
PatternW= re.compile(all_rrid_search, re.IGNORECASE)
xml_rrid_dict = {} # stores a mapping of xml publications to the list pf RRID key citations
directory_xml_dict = {} # stores a mapping of journal to xml_rrid_dict dictionary
xml_path = path+dir_ # path to each journal
if os.path.isdir(xml_path): # check if a directory exists
for xml_file in os.listdir(xml_path):
if not xml_file.endswith('.nxml'): continue # check is the file is xml
xml_file_path = os.path.join(xml_path, xml_file) # create path to the xml file
#parse the xml
tree = ET.parse(xml_file_path)
root = tree.getroot()
#convert xml to a string
string_root = ET.tostring(root, encoding='utf8').decode('utf8')
#serach for rrid in the string
match = re.findall(PatternW, string_root)
if len(match) != 0: # if RRID is found in the publication , store the result in the dictionaries
match_list = []
for mat in match:
match_list.append(mat)
# save the rrid result in the dictionaries
xml_rrid_dict[xml_file] = match_list
directory_xml_dict[dir_] = xml_rrid_dict
else: # if RRID is fNOT found ,do nothing
pass
return directory_xml_dict
# mapper_func function -- maps each XML publication to its parent Journal.
def mapper_func(dir_):
path = "/mnt/storage1/nlp/pubmed_xml/"
map_dict = {}
xml_path = path+dir_ #creates path to the directory
xml_list = []
if os.path.isdir(xml_path): # check if a directory exists
for xml_file in os.listdir(xml_path):
if not xml_file.endswith('.nxml'): continue # check if the file is XML
xml_file_path = os.path.join(xml_path, xml_file) # creates path to the xml file
xml_list.append(xml_file)
map_dict[dir_] = xml_list # Journal xml mapping
return map_dict
Using the ProcessPoolExecutor 32 parallel processes have been created by making multiple calls to the RRID_extractor function. Each process extracts RRID keys from one journal at a time.
# using 32 parallel processes to extract RRIDs
from os import listdir
import concurrent.futures
import operator
import os
# create a dictionary that stores the RRID extraction result
result_dict_2 = {}
#Create list of directories
pmed_directory_list = listdir("/mnt/storage1/nlp/pubmed_xml/")
with concurrent.futures.ProcessPoolExecutor() as executor:
for i in range(0,len(pmed_directory_list[:32]),32): # the first 32 Journals -> pmed_directory_list[:32]
result_list = []
for dir_rk in pmed_directory_list[i:i+32]: # Create 32 processes
results = executor.submit(RRID_extractor, dir_rk) # returns journal_publication_rrid dictionary
result_list.append(results)
for f in concurrent.futures.as_completed(result_list):
for key_ in list(f.result().keys()):
result_dict_2[key_] =f.result().get(key_)
print("RRID Search result in =>:", result_dict_2)
Extracted RRID results from the above cell have been saved in a JSON file _3.Journal_xml_RRID_final_Result.json.
Note: For the purpose of demonstration, the above cell runs over a small sample of only the first 32 PubMed Journals range(0,len(pmed_directory_list[:32]),32).
It would take 3-4 days if the cell runs overall 15k journals range(0,len(pmed_directory_list[:]),32). This is due to a large number of publications, longer regX search patterns, internet speed, etc.
# read previously saved file from RRID extraction
with open("/home/bk378/t20201023-RRID_analysis-BK/Files/jsons/_3.Journal_xml_RRID_final_Result.json", "r") as read_file:
Journal_xml_RRID_final_Result = json.load(read_file)
# create a list of XML files that has RRID citation
xml_with_RRID_list = []
for journal in list(Journal_xml_RRID_final_Result.keys()):
for xml in list(Journal_xml_RRID_final_Result.get(journal).keys()):
xml_with_RRID_list.append(xml)
# Create a list of all RRID keys
all_rrid_list = []
for journal in list(Journal_xml_RRID_final_Result):
for rrid_list in list(Journal_xml_RRID_final_Result.get(journal).values()):
for rrid_key in rrid_list:
all_rrid_list.append(rrid_key)
print(f"{len(all_rrid_list)} RRID citations in {len(xml_with_RRID_list)} publications, in {len(Journal_xml_RRID_final_Result)} Journals.")
# How many unique RRID citations
import numpy as np
# list out unique RRID keys from all_rrid_list
resource_name, resource_counts = np.unique(all_rrid_list, return_counts=True)
Resource_count_dict = dict(zip(resource_name, resource_counts))
unique_rrid_list = list(Resource_count_dict.keys())
print(f"Out of the {len(all_rrid_list)} RRIDs, {len(unique_rrid_list)} RRID citations are unique. ")
What portion of unique RRID keys are found on the Sci-Crunch?
To check what portion 45,220 unique RRID keys are found on the Sci-Crunch, each RRID key is checked on the Sci-Crunch and the result has been retrieved using web-scraping. Some of the RRIDs are not found on the Sci-crunch returning 404 error as shown here.
import requests
import re
from bs4 import BeautifulSoup as soup
# Sci-Crunch Resolver address
resolver_url = "https://scicrunch.org/resolver/"
# create lists that store RRIDs found/not-found on the Sci-Crunch
rrids_found_on_resolver = []
rrids_not_found_on_resolver = []
# for each unique RRID key, web-scrape the result of the Sci-Crunch resolver
for rrid_ in unique_rrid_list:
# create a link
resource_url = resolver_url+rrid_
# retrrieve the text from the sci-crunch result page
resource_page_text = requests.get(resource_url).text
# parse the page
page_soup = soup(resource_page_text, "html.parser")
# for RRIDs found on the Sci-crunch html div tag has "id":"data_info" property
info_container = page_soup.find("div", {"id":"data_info"})
# for RRIDs NOT found on the Sci-crunch html div tag has class_="error-v3" property
info_container2 = page_soup.find("div", class_="error-v3")
# RRIDs found
if info_container != None: # RRIDs found
print(f"{rrid_} is found at {resource_url}")
rrids_found_on_resolver.append(rrid_)
# RRIDs not-found
elif info_container2 != None:
print(f"{rrid_} is NOT found at {resource_url}")
rrids_not_found_on_resolver.append(rrid_)
# exception
else:
# print the page response
print(resource_page_text)
pass
break
Note: I tried to do the web scrapping using 32 parallel processes, making 32 requests in parallel to the Sci-Crunch resolver. Since then, my server is probably blacklisted which is why the 403 Forbidden response. Therefore, I did the web scraping using my local machine, saved the results, and uploaded them to the Hamilton server.
# read the results of the web_scraping
with open('/home/bk378/t20201023-RRID_analysis-BK/Files/jsons/_5.found_rrid_list_on_SciCrunch.txt', 'r') as file:
found_rrid_list =sj.load(file)
with open('/home/bk378/t20201023-RRID_analysis-BK/Files/jsons/_6.missing_rrid_list_on_SciCrunch.txt', 'r') as file:
missing_rrid_list =sj.load(file)
# calculate percentage of RRIDs found on the Sci-Crunch
Result = len(found_rrid_list)/len(unique_rrid_list)*100
print(f"From {len(unique_rrid_list)} unique RRIDs, {len(found_rrid_list)} are Found on the Sci-Crunch Resolver. \nThat is {Result} %")
As stated before, on the regular expression specification section above, some authors did not include the RRID-Part. When a resource citation lacks the RRID-Part, it is difficult to know the intention of the author (whether or not the author is using research resource identifiers (RRIDs) or something else).
RRID-Part?¶# RRIDs that are found on Sci_Crunch
found_rrid_list_with_RRID_part = [] # stores a list of RRIDS that have RRID part like RRID:Prefix_xxxxxx
found_rrid_list_with_noRRID_part = [] # stores a list of RRIDS that have NO RRID part like Prefix_xxxxxx
for rrid_ in found_rrid_list: # for RRIDs found on the sci-crunch
if 'RRID' in rrid_: # if there is RRID-Part store it into found_rrid_list_with_RRID_part list
found_rrid_list_with_RRID_part.append(rrid_)
elif 'RRID' not in rrid_: # if there is NO RRID-Part store it into found_rrid_list_with_noRRID_part list
found_rrid_list_with_noRRID_part.append(rrid_)
else:
pass
# RRIDs that are NOT found on Sci_Crunch
missing_list_with_RRID_part = [] # stores a list of RRIDS that have RRID part like RRID:Prefix_xxxxxx
missing_list_with_noRRID_part = [] # stores a list of RRIDS that have NO RRID part like Prefix_xxxxxx
for rrid_ in missing_rrid_list:
if 'RRID' in rrid_: # if there is RRID-Part
missing_list_with_RRID_part.append(rrid_)
elif 'RRID' not in rrid_: # if there is NO RRID-Part
missing_list_with_noRRID_part.append(rrid_)
else:
pass
print(f'From {len(found_rrid_list)} RRIDs found on the Sci-Crunch, {len(found_rrid_list_with_noRRID_part)} have "NO RRID part" : {len(found_rrid_list_with_noRRID_part)/len(found_rrid_list)*100} %')
print(f'From {len(found_rrid_list)} RRIDs found on the Sci-Crunch, {len(found_rrid_list_with_RRID_part)} have "RRID part", : {len(found_rrid_list_with_RRID_part)/len(found_rrid_list)*100} % \n')
print(f'From {len(missing_rrid_list)} RRIDs Not-found on the Sci-Crunch, {len(missing_list_with_noRRID_part)} have "NO RRID part" : {len(missing_list_with_noRRID_part)/len(missing_rrid_list)*100} %')
print(f'From {len(missing_rrid_list)} RRIDs Not-found on the Sci-Crunch, {len(missing_list_with_RRID_part)} have "RRID part", : {len(missing_list_with_RRID_part)/len(missing_rrid_list)*100} %')
The above result indicates that more than half of the authors (56%), did not include the RRID-Part. Is this true? What is the authors' intention?, were authors citing research resources using RRIDs or are they citing something else? Do RRID citations appear in the same way in XML files (used in this project) vs in the PDF files?
To find answers for those questions, a sample of actual papers have been investigated in 4-scenarios as follows:
i. RRIDs found on Sci-Crunch and has "RRID-Part" :
In a sample of 8 publications (link provided) that contain RRIDs found on the Sci-Crunch with RRID-Part, typically the following things has been observed:
ii. RRIDs found on Sci-Crunch and has NO "RRID-Part":
In a sample of 8 publications that contain RRIDs found on the Sci-Crunch but has NO RRID-Part typically the following things has been observed:
iii. RRIDs NOT found on Sci-Crunch and has "RRID-Part":
From a sample of 5 publications that contain RRIDs NOT found on the Sci-Crunch but have RRID-Part the following things has been observed:
iv. RRIDs NOT found on Sci-Crunch and has NO "RRID-Part":
In a sample of 6 publications that contain resource citations which are NOT found on the Sci-Crunch and as well have NO RRID-Part typically the following things has been observed:
From the above 4 cases, we can draw the following conclusions:
Most of the time authors have included the RRID-Part.
If RRID-Part is included and if the RRID resource is found on the Sci-Crunch, the author most likely intentionally used RRIDs.
If RRID-Part is included but if the RRID resource is not found on the Sci-Crunch, this means the author intentionally used RRIDs but probably made a mistake while citing the RRID key.
If the RRID-part is not included and if the resource is not found on the sci-crunch, this means the resource is most likely a Non-RRID resource and the author did not mean to use RRIDs.
Note: No cases have been found where the author used a Non-RRID resource which is found in the Sci-Crunch.
Types of research resources¶Types of research resources refer to the category of a research resource as software/database, antibody, cell-line, plasmid, etc. A type of resource is specified by the prefix of its RRID key. From the list of 42,975 unique RRIDs, 16 unique types of resources have been found below :
#define a func that extracts a type of resource (RRID-prefix)
def unique_rrids(all_rrid_list): # takes all lists of RRIDs extracted from all publications in 660 journals
rrid_key_list = []
for str_ in all_rrid_list: # for each RRID in all_rrid_list
if type(str_) == str:
if 'IMSR_JAX'in str_ : # if 'IMSR_JAX' is in the RRID key
key_ = str_.split(':', 1)[0].strip().upper()
elif 'Addgene' in str_: #if 'Addgene' is in the RRID key
key_ = str_.split('_', 1)[0].strip()
else:
key_ = str_.split('_', 1)[0].strip().upper()
if key_ not in rrid_key_list:
rrid_key_list.append(key_)
unique_key_list = [] # stores unique resource types (RRID-prefix)
for rrid_key in rrid_key_list:
if ':' in rrid_key:
unique_key = rrid_key.split(':',1)[1].strip()
if unique_key not in unique_key_list:
unique_key_list.append(unique_key)
else:
if rrid_key not in unique_key_list:
unique_key_list.append(rrid_key)
return unique_key_list
uniqe_rrid_list = unique_rrids(all_rrid_list)
print(f'We have {len(uniqe_rrid_list)} distinct types of Resources! \n\nThese are:')
uniqe_rrid_list
Note: Some types of resources are not found on the Sci-Crunch. These are:
From 152,598 research resource citations throughout 660 Journals:
Steps to pie-chart:
152,598 RRID citation instances. # create Journal- RRID_list mapping
with open('/home/bk378/t20201023-RRID_analysis-BK/Files/jsons/_3.Journal_xml_RRID_final_Result.json', 'r') as file:
RRIDs_final_Result =sj.load(file)
# create Journal_RRID mapping from Journal_XML_RRID mapping
Journal_RRID_dict = {}
for journal in list(RRIDs_final_Result):
# unnest inner list
Journal_RRID_dict[journal] = sum(list(RRIDs_final_Result.get(journal).values()), [])
# initialize a list that stores all instances of pre-fixes (resource type)
all_resource_types = []
for j in list(Journal_RRID_dict):
resource_type_list = []
for str_ in Journal_RRID_dict.get(j):
if 'RRID' in str_: # if the resource citation has RRID part
# for RRID:IMSR_JAX:000664 format of resource
if 'IMSR_JAX'in str_ :
# separte the RRID-Part
imsr_jax_num = str_.split(":", 1)[1].strip()
# separate the IMSR Jax
imsr_jax = imsr_jax_num.split(":", 1)[0].strip().upper()
resource_type_list.append(imsr_jax)
# for RRID:Addgene_000664 format of resource
elif 'Addgene' in str_:
# separte the RRID-Part
addgene_num = str_.split(':', 1)[1].strip()
# separte the addgene
addgene = addgene_num.split('_', 1)[0].strip().upper()
resource_type_list.append(addgene)
else:
# for RRID:prefix_000664 format of resource
# separte the RRID-Part
prefix_num = str_.split(':', 1)[1].strip()
# separte the prefix
prefix = prefix_num.split('_', 1)[0].strip().upper()
resource_type_list.append(prefix)
elif 'RRID' not in str_: # if the resource citation so not have RRID part
# for IMSR_JAX:000664 format of resource
if 'IMSR_JAX'in str_ :
# separate the IMSR Jax
imsr_jax = str_.split(":", 1)[0].strip().upper()
resource_type_list.append(imsr_jax)
# for Addgene_000664 format of resource
elif 'Addgene' in str_:
# separte the addgene
addgene = str_.split('_', 1)[0].strip().upper()
resource_type_list.append(addgene)
else:
# for all cases prefix_1232548
# separte the prefix
prefix = str_.split('_', 1)[0].strip().upper()
resource_type_list.append(prefix)
all_resource_types.extend(resource_type_list)
# count resource types
import numpy as np
# count each type of resource
rsource_typ_labels, rsource_typ_counts = np.unique(all_resource_types, return_counts=True)
# store counts of each resource type in a dict
Resource_count_dict = dict(zip(rsource_typ_labels, rsource_typ_counts))
# sort resource counts in descending order
Resource_count_dict_sorted = dict(sorted(Resource_count_dict.items(), key=lambda item: item[1], reverse=True)[:])
# delete exceptions
del Resource_count_dict_sorted['RRID:AB']
del Resource_count_dict_sorted['IMSR']
# add non-significant resource types together
others = ['MMRRC', 'RGD', 'DGGR', 'MGI','EXRC', 'NXR', 'NSRRC', 'TSC', 'AGSC' ]
sum_others_count = 0
for key in others:
sum_others_count += Resource_count_dict_sorted.get(key)
# take the top 6 significant resource counts
Top_6_resource_typs = dict(sorted(Resource_count_dict.items(), key=lambda item: item[1], reverse=True)[:6])
# add counts of insignificant resources to the dict
Top_6_resource_typs['Others'] = sum_others_count
label = list(Top_6_resource_typs.keys())
resource_type_counts = list(Top_6_resource_typs.values())
# visualize the shre of resources pie-chart
import matplotlib.pyplot as plt
# plot style
plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14, 7), subplot_kw=dict(aspect="equal"))
# Data to plot
label = list(Top_6_resource_typs.keys())
explode = (0.1, 0.1, 0.09,1,1,1,1) # explode 1st slice
plt.pie(resource_type_counts,explode=explode,autopct='%1.1f%%', shadow=False, startangle=180)
# annotate
plt.legend(['AB - Antibody',
'SCR - Software/Databse',
'CVCL- Cell Line',
'BDSC - organism',
'IMSR - Mouse Strain',
'ADDGENE - Plasmids',
'OTHERS - Various Organisms'], loc="best")
ax.set_title("Share of Resources types in all Journals")
plt.axis('equal')
fig.savefig("Plot: 2.Share of Resources.png", dpi = 130)
plt.show()
# read the csv created above into pandas data frame
Journal_resource_count = pd.read_csv('/home/bk378/t20201023-RRID_analysis-BK/Files/CSVs/_3.Journal_article_title_resource_count.csv')
# aggregate sum - add all rrid citations in each journal
Journal_resource_count_agg = Journal_resource_count.groupby('Journal').sum()
# calculate the total RRID citation in each Journal
Journal_resource_count_agg["Total RRID in Journal"] = Journal_resource_count_agg.sum(axis=1)
# sort by by Total RRID Count
Journal_resource_count_agg_sorted = Journal_resource_count_agg.sort_values("Total RRID in Journal", ascending = False)
# take the top10 Journals by RRID Count
Top_10_Journal_resource_count_agg = Journal_resource_count_agg_sorted.head(10)
Top_10_Journal_resource_count_agg
import matplotlib.pyplot as plt
# plot style
plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14, 7))
# PLoat Top 10 Journals by RRID citation count
RRID_Count_ind = Top_10_Journal_resource_count_agg['Total RRID in Journal'].index
RRID_Count_val = Top_10_Journal_resource_count_agg['Total RRID in Journal'].values
ax.bar(RRID_Count_ind, RRID_Count_val)
# annotate the plot
ax.set_xticklabels(RRID_Count_ind, rotation=15)
ax.set_title("Top 10 journals by RRID citation count")
ax.set_ylabel("RRID Count")
ax.set_xlabel("Journals")
fig.savefig("Plot: 3.Top 10 Journals by RRID Count.png", dpi = 130)
plt.show()
On the above graph, we see the total counts of RRID citations in the top 10 journals that have the most RRID citation counts. One might be interested to know the share of each type of resource type in each of the above 10 journals. The results are shown on the bar graph below.
#Using the data frame Top_10_Journal_resource_count_agg
import matplotlib.pyplot as plt
# plot style
plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14, 7))
# PLoat AB Antibody
AB_ind = Top_10_Journal_resource_count_agg['AB Count'].index
AB_val = Top_10_Journal_resource_count_agg['AB Count'].values
ax.bar(AB_ind, AB_val , label="AB-Antibody")
# Plot Software /DB
SCR_ind = Top_10_Journal_resource_count_agg['SCR Count'].index
SCR_val = Top_10_Journal_resource_count_agg['SCR Count'].values
ax.bar(SCR_ind, SCR_val,bottom= AB_val, label="SCR-Software/Database" )
# plot CVCL Count
CVCL_ind = Top_10_Journal_resource_count_agg['CVCL count'].index
CVCL_val = Top_10_Journal_resource_count_agg['CVCL count'].values
ax.bar(CVCL_ind, CVCL_val,bottom= AB_val +SCR_val , label="CVCL-Cell Line" )
# BDSC Count
BDSC_ind = Top_10_Journal_resource_count_agg['BDSC Count'].index
BDSC_val = Top_10_Journal_resource_count_agg['BDSC Count'].values
ax.bar(BDSC_ind, BDSC_val,bottom= AB_val +SCR_val+CVCL_val , label="BDSC-Organism" )
# IMSR Count
IMSR_ind = Top_10_Journal_resource_count_agg['IMSR Count'].index
IMSR_val = Top_10_Journal_resource_count_agg['IMSR Count'].values
ax.bar(IMSR_ind, IMSR_val,bottom= AB_val +SCR_val+CVCL_val+BDSC_val , label="IMSR-Mouse Strain" )
# Addgene Count
Add_ind = Top_10_Journal_resource_count_agg['Addgene Count'].index
Add_val = Top_10_Journal_resource_count_agg['Addgene Count'].values
ax.bar(Add_ind, Add_val,bottom= AB_val +SCR_val+CVCL_val+BDSC_val+IMSR_val , label="Addgene-Plasmids" )
# annotate the plot
ax.set_xticklabels(Top_10_Journal_resource_count_agg['AB Count'].index, rotation=15)
ax.set_title("Share of Resources in top 10 journals")
ax.set_ylabel("RRID Count")
ax.set_xlabel("Journals")
ax.legend()
fig.savefig("Plot: 4.Share of Resources in top 10 journals.png", dpi = 130)
plt.show()
The above result shows the share of each resource type only in the top 10 journals that have the most RRID count. Here the average RRID citation of each type of research resource type in all 660 Journals have been calculated as follows:
# calculate sum of each type of resource
Journal_resource_agg = Journal_resource_count.groupby('Journal').sum()
import matplotlib.pyplot as plt
# plot style
plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14, 7))
# Average AB - antibody citation in all Journals
AB_Journal_resource_agg_avg = Journal_resource_agg[["AB Count"]].mean()
ax.bar("Antibody", AB_Journal_resource_agg_avg)
# Average software - SCR citation in all Journals
SCR_Journal_resource_agg_avg = Journal_resource_agg[["SCR Count"]].mean()
ax.bar("Software/DB", SCR_Journal_resource_agg_avg)
# Average Cell Line - CVCL citation in all Journals
CVCL_Journal_resource_agg_avg = Journal_resource_agg[["CVCL count"]].mean()
ax.bar("CVCL", CVCL_Journal_resource_agg_avg)
# Average BDSC citation in all Journals
BDSC_Journal_resource_agg_avg = Journal_resource_agg[["BDSC Count"]].mean()
ax.bar("BDSC", BDSC_Journal_resource_agg_avg)
# Average IMSR citation in all Journals
IMSR_Journal_resource_agg_avg = Journal_resource_agg[["IMSR Count"]].mean()
ax.bar("IMSR", IMSR_Journal_resource_agg_avg)
# Average Addgene citation in all Journals
Add_Journal_resource_agg_avg = Journal_resource_agg[["Addgene Count"]].mean()
ax.bar("Addgene", Add_Journal_resource_agg_avg)
ax.set_title("Avg research resource type citation in all Journals ")
ax.set_ylabel("Average RRID ")
ax.set_xlabel("Popular research resources ")
fig.savefig("Plot: 5.Avg research resource type citation in all Journals.png", dpi = 130)
plt.show()
RRID Per capita¶Since the number of publications is not the same in all journals, RRID per capita is used to measure the density of RRID citations in each Journal.
RRID_Percapita = Number of RRID Citations in a Journal / Total number of XML publications in a Journal
# read files created before
with open('/home/bk378/t20201023-RRID_analysis-BK/Files/jsons/_1.Journal_xml_mapping_dict.json', "r") as read_file:
Journal_xml_mapping = json.load(read_file)
with open('/home/bk378/t20201023-RRID_analysis-BK/Files/jsons/_7b.Journal_RRID_count.json', "r") as read_file:
Journal_RRID_count = json.load(read_file)
# Calculate the RRID Percapita
rrid_percap_dict = {}
for vals in list(Journal_xml_mapping):
if vals in list(Journal_RRID_count):
rrid_percap_dict[vals] = round((Journal_RRID_count.get(vals) /len(list(Journal_xml_mapping.get(vals)))),2)
# Top 10 RRID Percapita Journals
Top_10_Journals_RRID_parcapita = dict(sorted(rrid_percap_dict.items(), key=lambda item: item[1], reverse=True)[:10])
import matplotlib.pyplot as plt
# style of plot
plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14, 7))
x = list(Top_10_Journals_RRID_parcapita.keys())
y = list(Top_10_Journals_RRID_parcapita.values())
ax.barh(x, y, align='center')
# annotate the plot
ax.set_yticklabels(list(Top_10_Journals_RRID_parcapita.keys()), )
ax.set_ylabel("Journals")
ax.set_xlabel("RRID percapita")
ax.set_title("Top 10 Journals by RRID Percapita")
ax.invert_yaxis()
fig.savefig("Plot: 6.Top 10 Journals by RRID Percapita.png", dpi = 130)
plt.show()
# read the csv created above into pandas data frame
Journal_resource_count = pd.read_csv('/home/bk378/t20201023-RRID_analysis-BK/Files/CSVs/_3.Journal_article_title_resource_count.csv')
# aggregate sum - add all rrid citations in each journal
Journal_resource_count_agg = Journal_resource_count.groupby('Journal').sum()
# sort by by Total software resource count
Journal_resource_count_agg_sorted = Journal_resource_count_agg.sort_values("SCR Count", ascending = False)
Journal_resource_count_agg_sorted
# take the top10 Journals by software Count
Top_10_Journal_SCR_count_agg = Journal_resource_count_agg_sorted[["SCR Count"]].head(10)
#Using the data frame Top_10_Journal_resource_count_agg
import matplotlib.pyplot as plt
# plot style
plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14, 7))
# PLoat AB Antibody
SCR_ind = Top_10_Journal_SCR_count_agg['SCR Count'].index
SCR_val = Top_10_Journal_SCR_count_agg['SCR Count'].values
ax.barh(SCR_ind, SCR_val , label="SCR Count")
# annotate the plot
ax.set_yticklabels(SCR_ind)
# annotate
ax.set_title("Top 10 Journals by Software citation Count")
ax.set_xlabel("RRID Counts")
ax.set_ylabel("Journals")
ax.invert_yaxis()
fig.savefig("Plot: 7.Top 10 Journals by Software Resource Count.png", dpi = 130)
plt.show()
A given resource might be cited multiple times in a given journal. A single instance of software resource citations has been counted in each journal and plotted on the bar graph below.
Journal_unique_scr_count = {}
for journal_ in Journal_RRID_dict:
scr_list = []
for rrid_ in Journal_RRID_dict.get(journal_):
if 'SCR' in rrid_:
if rrid_ not in scr_list:
scr_list.append(rrid_)
#print(journal_,rrid_)
Journal_unique_scr_count[journal_] = len(scr_list)
# take the top 10 journals by unique software resource citation count
Top10_Journals_by_unique_scr_count = dict(sorted(Journal_unique_scr_count.items(), key=lambda item: item[1], reverse=True)[:10])
import matplotlib.pyplot as plt
plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14,7))
ax.barh(list(Top10_Journals_by_unique_scr_count.keys()), list(Top10_Journals_by_unique_scr_count.values()))
# annotate
ax.set_title("Top 10 Journals by distinct Software Resource Count")
ax.set_xlabel("RRID Counts")
ax.set_ylabel("Journals")
ax.invert_yaxis()
fig.savefig("Plot: 8 Top 10 Journals by distinct Software Resource Count.png", dpi = 130)
plt.show()
# read data from csv using panda df
import pandas as pd
RRID_citations_by_publication_date = pd.read_csv('/home/bk378/t20201023-RRID_analysis-BK/Files/CSVs/_1.Journal_xml_rriCount_ePubDate.csv')
Software_citations_by_publication_date = pd.read_csv('/home/bk378/t20201023-RRID_analysis-BK/Files/CSVs/_2.Journal_xml_SCR_RRID_with_article_title.csv')
# set index to date time
RRID_citations_by_publication_date= RRID_citations_by_publication_date.set_index('epub_Date')
Software_citations_by_publication_date = Software_citations_by_publication_date.set_index('epub_Date')
# chnange the index to date time object
RRID_citations_by_publication_date.index = pd.to_datetime(RRID_citations_by_publication_date.index)
Software_citations_by_publication_date.index = pd.to_datetime(Software_citations_by_publication_date.index)
#Sort by Date
RRID_citations_by_publication_date = RRID_citations_by_publication_date.sort_index(ascending = True)
Software_citations_by_publication_date = Software_citations_by_publication_date.sort_index(ascending = True)
Software_citations_by_publication_date.head()
#Software_citations_by_publication_date
#resample('M') -- Monthly average
#resample('Y') -- Annual average
# sample yearly Y
Yearly_average_RRID_citations = RRID_citations_by_publication_date.resample('M')[["RRID Count"]].mean().fillna(0)
Yearly_average_software_citations = Software_citations_by_publication_date.resample('M')[["RRID Count"]].mean().fillna(0)
# plotting our data
import matplotlib.pyplot as plt
# plot style
plt.style.use("bmh")
# fig size
fig, ax = plt.subplots(figsize=(14,7))
# PLOT RRID citation counts in all Journals
dates = Yearly_average_RRID_citations.index
rrids = Yearly_average_RRID_citations.values
ax.plot(dates,rrids , alpha=0.9, label="all RRIDs (including Software)")
# PLOT software citation counts in all Journals
scr_rrids = Yearly_average_software_citations.values
scr_dates = Yearly_average_software_citations.index
ax.plot(scr_dates, scr_rrids, color='r', alpha=0.9, label="software")
# annotate the plot
ax.legend(loc="upper left")
ax.set_xlabel('Time (Years)')
ax.set_ylabel('Mothly Average RRID count ')
ax.set_title("RRID Vs Software Monthly citation all Journals ")
fig.savefig("Plot: 9M.RRID Vs Software Monthly average citation all Journals.png", dpi = 250)
plt.show()
RRID_citations_by_publication_date_PLOS =RRID_citations_by_publication_date[RRID_citations_by_publication_date['Journal'] == 'PLoS_One']
Software_citations_by_publication_date_PLOS = Software_citations_by_publication_date[Software_citations_by_publication_date['Journal'] == 'PLoS_One']
Yearly_average_RRID_citations_plos = RRID_citations_by_publication_date_PLOS.resample('Y')[["RRID Count"]].mean().fillna(0)
Yearly_average_software_citations_plos = Software_citations_by_publication_date_PLOS.resample('Y')[["RRID Count"]].mean().fillna(0)
# plotting our data
import matplotlib.pyplot as plt
# plot style
plt.style.use("bmh")
# fig size
fig, ax = plt.subplots(figsize=(14,7))
# PLOT RRID citation counts in PLoS One
dates = Yearly_average_RRID_citations_plos.index
rrids = Yearly_average_RRID_citations_plos.values
ax.plot(dates,rrids , marker="o", alpha=0.5, label="all RRIDs (including Software)")
# PLOT software citation counts in PLoS One
scr_rrids = Yearly_average_software_citations_plos.values
scr_dates = Yearly_average_software_citations_plos.index
ax.plot(scr_dates, scr_rrids, marker="o", color='r', alpha=0.5, label="software")
# annotate the plot
ax.legend(loc="upper right")
ax.set_xlabel('Time (Years)')
ax.set_ylabel('Annual Average RRID count ')
ax.set_title("Use of RRIDs over the years in PLoS One Journal")
fig.savefig("Plot: 10.RRID vs Software annual average citation in PLoS One.png", dpi = 250)
plt.show()
RRID_citations_by_publication_date_eLife =RRID_citations_by_publication_date[RRID_citations_by_publication_date['Journal'] == 'eLife']
Software_citations_by_publication_date_eLife = Software_citations_by_publication_date[Software_citations_by_publication_date['Journal'] == 'eLife']
Yearly_average_RRID_citations_eLife = RRID_citations_by_publication_date_eLife.resample('M')[["RRID Count"]].mean().fillna(0)
Yearly_average_software_citations_eLife = Software_citations_by_publication_date_eLife.resample('M')[["RRID Count"]].mean().fillna(0)
# plotting our data
import matplotlib.pyplot as plt
# plot style
plt.style.use("bmh")
# fig size
fig, ax = plt.subplots(figsize=(14,7))
# PLOT RRID citation counts in eLife
dates = Yearly_average_RRID_citations_eLife.index
rrids = Yearly_average_RRID_citations_eLife.values
ax.plot(dates,rrids, alpha=0.9, label="all RRIDs (including Software)")
# PLOT software citation counts in eLife
scr_rrids = Yearly_average_software_citations_eLife.values
scr_dates = Yearly_average_software_citations_eLife.index
ax.plot(scr_dates, scr_rrids, color='r', alpha=0.9, label="software")
# annotate the plot
ax.legend(loc="upper right")
ax.set_xlabel('Time (Years)')
ax.set_ylabel('Monthly Average RRID count ')
ax.set_title("RRIDs over the years in eLife Journal")
fig.savefig("Plot: 11M.RRID vs Software Monthly average citation in eLife.png", dpi = 250)
plt.show()
# read data from csv using panda df
import pandas as pd
SCR_citations_by_publication_date = pd.read_csv('/home/bk378/t20201023-RRID_analysis-BK/Files/CSVs/_2.Journal_xml_SCR_RRID_with_article_title.csv')
# set index to date time
SCR_citations_by_publication_date= SCR_citations_by_publication_date.set_index('epub_Date')
# chnange the index to date time object
SCR_citations_by_publication_date.index = pd.to_datetime(SCR_citations_by_publication_date.index)
#Sort by Date
SCR_citations_by_publication_date = SCR_citations_by_publication_date.sort_index(ascending = True)
SCR_citations_by_publication_date.head(6)
#Software_citations_by_date
The first software resource cited using RRID was RRID:SCR_013827, a Statistics Calculator. Interestingly, the publication was published by the Resource Identification Initiative (RII), the same initiative that introduced RRIDs in 2014. The publication can be found here. The other results prior to 2014, SCR_00085 and SCR_00479 are not RRIDs, but they refer to something known as a Ligand.
Counting of all unique instances of RRIDs from a list of all 152,598 RRIDs has been done.
# count all unique RRIDs from all_rrid_list
import numpy as np
unique_labels, unique_counts = np.unique(all_rrid_list, return_counts=True)
Resource_count_dict = dict(zip(unique_labels, unique_counts))
# take the top 10 resources which has the most frequency of citation
Top10_Resource_count_dict = dict(sorted(Resource_count_dict.items(), key=lambda item: item[1], reverse=True)[:10])
# visualize popular software
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud
cloud = WordCloud(max_font_size=400, width=1078,
height=720,
background_color="black",
colormap="hsv").generate_from_frequencies(Resource_count_dict)
plt.figure(figsize=(12,14))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
cloud.to_file('Plot: 15.Popular software.png')
plt.show()
The most popular software cited using RRIDs are:
GraphPad Prism SCR_002798) Statistical analysis software.
ImageJ (SCR_003070) : Open source Java based image processing software.
MATLAB (SCR_001622) : Multi paradigm numerical computing environment.
Fiji (SCR_002285) : Software package as distribution of ImageJ.
Popular_software = pd.read_csv('/home/bk378/t20201023-RRID_analysis-BK/Files/CSVs/_4.Popular softwares citation count in journals.csv')
# set index to date time
Popular_software= Popular_software.set_index('epub_Date')
# chnange the index to date time object
Popular_software.index = pd.to_datetime(Popular_software.index)
#Sort by Date
Popular_software_sorted = Popular_software.sort_index(ascending = True)
Popular_software_sorted.head()
# Calculate the average citations
avg_Prism = Popular_software_sorted.resample('Y')[["Prism"]].mean().fillna(0)
avg_ImageJ = Popular_software_sorted.resample('Y')[["ImageJ"]].mean().fillna(0)
avg_Matlab = Popular_software_sorted.resample('Y')[["Matlab"]].mean().fillna(0)
avg_Fiji = Popular_software_sorted.resample('Y')[["Fiji"]].mean().fillna(0)
avg_R = Popular_software_sorted.resample('Y')[["R"]].mean().fillna(0)
# plotting our data
import matplotlib.pyplot as plt
# plot style
plt.style.use("bmh")
# fig size
fig, ax = plt.subplots(figsize=(14,7))
# plot Prism software citation over years
Prism_dates = avg_Prism.index
Prism_rrids = avg_Prism.values
ax.plot(Prism_dates,Prism_rrids , alpha=1, label="1. Prism")
# plot ImageJ software citation over years
ImageJ_dates = avg_ImageJ.index
ImageJ_rrids = avg_ImageJ.values
ax.plot(ImageJ_dates,ImageJ_rrids , alpha=1, label="2. ImageJ ")
# plot Annual_avg_Matlab software citation over years
Matlab_dates = avg_Matlab.index
Matlab_rrids = avg_Matlab.values
ax.plot(Matlab_dates,Matlab_rrids , alpha=1, label="3. Matlab ")
# plot Fiji software citation over years
Fiji_dates = avg_Fiji.index
Fiji_rrids = avg_Fiji.values
ax.plot(Fiji_dates,Fiji_rrids , alpha=1, label="4. Fiji ")
# plot R software citation over years
R_dates = avg_R.index
R_rrids = avg_R.values
ax.plot(R_dates,R_rrids , alpha=1, label="5. R ")
# annotate the plot
ax.legend(loc="upper right")
ax.set_xlabel('Time (Years)')
ax.set_ylabel(' avg. citation count ')
ax.set_title("Avg. citation of popular software over years ")
ax.set_yscale('linear')
fig.savefig("Plot: 12.citation of popular software over years.png", dpi = 250)
plt.show()
# group by Journal total software citations
Popular_software_sorted_agg = Popular_software_sorted.groupby('Journal').sum()
Popular_software_sorted_agg
import matplotlib.pyplot as plt
# plot style
plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14, 7))
# Average Prism citation in all Journals
Prism_software_sorted_agg = Popular_software_sorted_agg[["Prism"]].mean()
ax.bar("Prism", Prism_software_sorted_agg)
# Average ImageJ citation in all Journals
ImageJ_software_sorted_agg = Popular_software_sorted_agg[["ImageJ"]].mean()
ax.bar("ImageJ", ImageJ_software_sorted_agg)
# Average Matlab citation in all Journals
Matlab_software_sorted_agg = Popular_software_sorted_agg[["Matlab"]].mean()
ax.bar("Matlab", Matlab_software_sorted_agg)
# Average Fiji citation in all Journals
Fiji_software_sorted_agg = Popular_software_sorted_agg[["Fiji"]].mean()
ax.bar("Fiji", Fiji_software_sorted_agg)
# Average R citation in all Journals
R_software_sorted_agg = Popular_software_sorted_agg[["R"]].mean()
ax.bar("R", R_software_sorted_agg)
ax.set_title("Average citation count of popular software in Journals")
ax.set_ylabel("Average RRID ")
fig.savefig("Plot: 13.Average citation count of popular software in Journals.png", dpi = 250)
plt.show()
Popular_software_sorted_agg_sorted = Popular_software_sorted_agg.sort_values(['Prism','ImageJ','Matlab', 'Fiji', 'R'] , ascending = False)
#TOP 10 journals by popular software citation count
top10_Journals_by_Popular_software_citation_count = Popular_software_sorted_agg_sorted.head(10)
top10_Journals_by_Popular_software_citation_count
import matplotlib.pyplot as plt
# plot style
plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14,7))
# PLoat Prism software
Prism_ind = top10_Journals_by_Popular_software_citation_count['Prism'].index
Prism_val = top10_Journals_by_Popular_software_citation_count['Prism'].values
ax.bar(Prism_ind, Prism_val , label="Prism")
# Plot ImageJ software
ImageJ_ind = top10_Journals_by_Popular_software_citation_count['ImageJ'].index
ImageJ_val = top10_Journals_by_Popular_software_citation_count['ImageJ'].values
ax.bar(ImageJ_ind, ImageJ_val,bottom= Prism_val, label="ImageJ" )
# plot Matlab Count
Matlab_ind = top10_Journals_by_Popular_software_citation_count['Matlab'].index
MatlabL_val = top10_Journals_by_Popular_software_citation_count['Matlab'].values
ax.bar(Matlab_ind, MatlabL_val,bottom= Prism_val+ImageJ_val , label="Matlab" )
# plot Fiji Count
Fiji_ind = top10_Journals_by_Popular_software_citation_count['Fiji'].index
Fiji_val = top10_Journals_by_Popular_software_citation_count['Fiji'].values
ax.bar(Fiji_ind, Fiji_val,bottom= Prism_val+ImageJ_val+MatlabL_val , label="Fiji" )
# plot R Count
R_ind = top10_Journals_by_Popular_software_citation_count['R'].index
R_val = top10_Journals_by_Popular_software_citation_count['R'].values
ax.bar(Fiji_ind, Fiji_val,bottom= Prism_val+ImageJ_val+MatlabL_val+Fiji_val , label="R" )
# annotate the plot
ax.set_xticklabels(top10_Journals_by_Popular_software_citation_count.index, rotation=15)
ax.set_title("Share of popular software in top 10 journals")
ax.set_ylabel("Software Count")
ax.set_xlabel("Journals")
ax.legend()
fig.savefig("Plot: 14.Share of popular software in top 10 journals.png", dpi = 250)
plt.show()
The analysis of publications above reveal the following facts:
RRID-Part of the syntax. Few authors ignored the RRID-Part, introduced unnecessary white space, or made a mistake citing the resources like a typo or wrong key. 152,598 RRID citations have been found in 660 PubMed Journals. Most of these are recurring research resource citations of 45,220 distinct types of research resources. 95% of these research resources are found on the Sci-Crunch. This result shows that most of RRID citations worked well to uniquely identify research resources, guaranteeing reproducibility of the result of researchers work. The rest of the cases where RRIDs are missing on the Sci-Crunch (5%) translates to authors mistake while using RRIDs or other keys which are not RRIDs.90 % of research resources of the research resources used in the pubMed journals are Antibodies (AB) 69 % and software/Database resources (19.6 %). Giga-science journal interestingly uses only software resources, not even one other type of research resource is found in Gigascience journal. In all other journals, fairly all kinds of research resources like antibodies, software, Cell-line, plasmids, etc. have been used in a proportion of:69.1 % Antibody (AB), 19.6 % software/DB (SCR), 4.5 % Cell Line (CVCL), 2.4 % Organism (BDSC), 2.3 % mouse strain (2.3%), 1.5 % Plasmids(Addgene), and 0.6 % others. GraphPad Prism, ImageJ,Matlab, Fiji and R, have been extensively used by researchers. This makes sense because in life science and biomedical research most of the time the researchers have to analyze images and perform numerical analysis during their experiment. The other interesting insight is, for some reason, Prism software have became very popular especially since 2019 where as Matlab have became less popular. In this project, the publications used for analysis of research resources are in the XML file format. When analyzing the use of RRIDs by authors' it was found earlier that RRID-Part from the RRID:prefix_xxxxx is embedded in the XML tag as an XML attribute separated from prefix_xxxx of RRID syntax. However, our regx formula captured only instances of resource citation where the RRID-Part occurs exactly following the RRID:prefix_xxxxx syntax. This has led to a false result indicating that most of the authors did not include RRID-Part in their RRID citation.
These inconsistencies in results have been found by manually comparing few samples of PDF publication files with XML (as shown in 3. How often do authors use RRIDs intentionally?). In future work, the citation of RRIDs can be compared in XML and PDF files in an automated way. This can be done by extracting attribute information from the XML tag and combining it with the rest of the RRID part to re-construct the actual RRID citation pattern, the way originally the author cited the resource.
The other thing, before creating RRIDs it would be great to perform text mining to check if an RRID resembling key has been already been assigned to another type of non RRID resource. During this project, two papers 1 and 2 have been found with a resource citation SCR_00085 and SCR_00479. These keys resemble an RRID citation without the RRID-Part but they are not actually RRIDs, rather they are Ligands.
By chance, if any software resource, not assigned an RRID key before, is given one of these keys or keys like those, this is going to create a conflict and ambiguity. Therefore for future work, text mining for keys has to be done before assigning a new RRID key to check if the key has already been used by other non-RRID types of resource. This is very critical given the fact that only 660 journals out of more than 15,000 journals have used RRIDs for the citation of research resources so far.